Introduction to Open Data Science - Course Project

About the project

I first noticed this course from our doctoral student mailing list and thought it sounded interesting. I think Open Data and Open Science are both important aspects of good modern research, which is why I enrolled to this course to learn more about them. Additionally, I hadn’t used R or RStudio let alone GitHub before this course, so I thought I could also learn to use these useful tools as well. So far this course has seemed very well organized, looking forward to learning more!

Reflecting on my learning experience

I thought that the R health for Data Science book together with the Exercise Set 1 were an excellent way to introduce us to basics of RStudio quickly and clearly. I especially liked using the exercise set, because it gave a nice visual after first reading about the different tools and commands first in the book. I don’t know if there was a certain topic that was more difficult than others, rather I found it to be more of learning the logic of R and use it for your own purposes.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec 12 19:16:15 2022"

Click here to see my GitHub repository.



Assignment 2

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 12 19:16:15 2022"

1. Part

The data I’m using in my analysis is derived from the larger “JYTOPKYS2”-dataset. This dataset was collected from a survey to statistics students, that was used to evaluate to effects of learning approaches to exam results.

Seven variables from the original dataset were chosen for this data (gender, age, attitude, deep learning, strategic learning, surface learning and exam points). Additionally, students who scored 0 in their exam were excluded from this data.

The variable attitude in learning2014 is a sum of 10 questions related to students attitude towards statistics, each measured on the Likert scale (1-5).

Variables deep (deep learning), stra(strategic learning) and surf(surface learning) in learning2014 are the mean values from the combinations of questions that relate to each learning approach, respectively. Each question was once again measured on the Likert scale (1-5).

In the following R chunk I will explore the structure and dimensions of the data further.

# reading the data into memory
learning2014 <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/learning2014.csv", sep = ",", header = TRUE)

#displaying dataframe
learning2014
##     gender age attitude     deep  stra     surf points
## 1        F  53       37 3.583333 3.375 2.583333     25
## 2        M  55       31 2.916667 2.750 3.166667     12
## 3        F  49       25 3.500000 3.625 2.250000     24
## 4        M  53       35 3.500000 3.125 2.250000     10
## 5        M  49       37 3.666667 3.625 2.833333     22
## 6        F  38       38 4.750000 3.625 2.416667     21
## 7        M  50       35 3.833333 2.250 1.916667     21
## 8        F  37       29 3.250000 4.000 2.833333     31
## 9        M  37       38 4.333333 4.250 2.166667     24
## 10       F  42       21 4.000000 3.500 3.000000     26
## 11       M  37       39 3.583333 3.625 2.666667     31
## 12       F  34       38 3.833333 4.750 2.416667     31
## 13       F  34       24 4.250000 3.625 2.250000     23
## 14       F  34       30 3.333333 3.500 2.750000     25
## 15       M  35       26 4.166667 1.750 2.333333     21
## 16       F  33       41 3.666667 3.875 2.333333     31
## 17       F  32       26 4.083333 1.375 2.916667     20
## 18       F  44       26 3.500000 3.250 2.500000     22
## 19       M  29       17 4.083333 3.000 3.750000      9
## 20       F  30       27 4.000000 3.750 2.750000     24
## 21       M  27       39 3.916667 2.625 2.333333     28
## 22       M  29       34 4.000000 2.375 2.416667     30
## 23       F  31       27 4.000000 3.625 3.000000     24
## 24       F  37       23 3.666667 2.750 2.416667      9
## 25       F  26       37 3.666667 1.750 2.833333     26
## 26       F  26       44 4.416667 3.250 3.166667     32
## 27       M  30       41 3.916667 4.000 3.000000     32
## 28       F  33       37 3.750000 3.625 2.000000     33
## 29       F  33       25 3.250000 2.875 3.500000     29
## 30       M  28       30 3.583333 3.000 3.750000     30
## 31       M  26       34 4.916667 1.625 2.500000     19
## 32       F  27       32 3.583333 3.250 2.083333     23
## 33       F  25       20 2.916667 3.500 2.416667     19
## 34       F  31       24 3.666667 3.000 2.583333     12
## 35       M  20       42 4.500000 3.250 1.583333     10
## 36       F  39       16 4.083333 1.875 2.833333     11
## 37       M  38       31 3.833333 4.375 1.833333     20
## 38       M  24       38 3.250000 3.625 2.416667     26
## 39       M  26       38 2.333333 2.500 3.250000     31
## 40       M  25       33 3.333333 1.250 3.416667     20
## 41       F  30       17 4.083333 4.000 3.416667     23
## 42       F  25       25 2.916667 3.000 3.166667     12
## 43       M  30       32 3.333333 2.500 3.500000     24
## 44       F  48       35 3.833333 4.875 2.666667     17
## 45       F  24       32 3.666667 5.000 2.416667     29
## 46       F  40       42 4.666667 4.375 3.583333     23
## 47       M  25       31 3.750000 3.250 2.083333     28
## 48       F  23       39 3.416667 4.000 3.750000     31
## 49       F  25       19 4.166667 3.125 2.916667     23
## 50       F  23       21 2.916667 2.500 2.916667     25
## 51       M  27       25 4.166667 3.125 2.416667     18
## 52       M  25       32 3.583333 3.250 3.000000     19
## 53       M  23       32 2.833333 2.125 3.416667     22
## 54       F  23       26 4.000000 2.750 2.916667     25
## 55       F  23       23 2.916667 2.375 3.250000     21
## 56       F  45       38 3.000000 3.125 3.250000      9
## 57       F  22       28 4.083333 4.000 2.333333     28
## 58       F  23       33 2.916667 4.000 3.250000     25
## 59       M  21       48 3.500000 2.250 2.500000     29
## 60       M  21       40 4.333333 3.250 1.750000     33
## 61       F  21       40 4.250000 3.625 2.250000     33
## 62       F  21       47 3.416667 3.625 2.083333     25
## 63       F  26       23 3.083333 2.500 2.833333     18
## 64       F  25       31 4.583333 1.875 2.833333     22
## 65       F  26       27 3.416667 2.000 2.416667     17
## 66       M  21       41 3.416667 1.875 2.250000     25
## 67       F  23       34 3.416667 4.000 2.833333     28
## 68       F  22       25 3.583333 2.875 2.250000     22
## 69       F  22       21 1.583333 3.875 1.833333     26
## 70       F  22       14 3.333333 2.500 2.916667     11
## 71       F  23       19 4.333333 2.750 2.916667     29
## 72       M  22       37 4.416667 4.500 2.083333     22
## 73       M  23       32 4.833333 3.375 2.333333     21
## 74       M  24       28 3.083333 2.625 2.416667     28
## 75       F  22       41 3.000000 4.125 2.750000     33
## 76       F  23       25 4.083333 2.625 3.250000     16
## 77       M  22       28 4.083333 2.250 1.750000     31
## 78       M  20       38 3.750000 2.750 2.583333     22
## 79       M  22       31 3.083333 3.000 3.333333     31
## 80       M  21       35 4.750000 1.625 2.833333     23
## 81       F  22       36 4.250000 1.875 2.500000     26
## 82       F  23       26 4.166667 3.375 2.416667     12
## 83       M  21       44 4.416667 3.750 2.416667     26
## 84       M  22       45 3.833333 2.125 2.583333     31
## 85       M  29       32 3.333333 2.375 3.000000     19
## 86       F  29       39 3.166667 2.750 2.000000     30
## 87       F  21       25 3.166667 3.125 3.416667     12
## 88       M  28       33 3.833333 3.500 2.833333     17
## 89       F  21       33 4.250000 2.625 2.250000     18
## 90       F  30       30 3.833333 3.375 2.750000     19
## 91       F  21       29 3.666667 2.250 3.916667     21
## 92       M  23       33 3.833333 3.000 2.333333     24
## 93       F  21       33 3.833333 4.000 2.750000     28
## 94       F  21       35 3.833333 3.500 2.750000     17
## 95       F  20       36 3.666667 2.625 2.916667     18
## 96       M  22       37 4.333333 2.500 2.083333     17
## 97       M  21       42 3.750000 3.750 3.666667     23
## 98       M  21       32 4.166667 3.625 2.833333     26
## 99       F  20       50 4.000000 4.125 3.416667     28
## 100      M  22       47 4.000000 4.375 1.583333     31
## 101      F  20       36 4.583333 2.625 2.916667     27
## 102      F  20       36 3.666667 4.000 3.000000     25
## 103      M  24       29 3.666667 2.750 2.916667     23
## 104      F  20       35 3.833333 2.750 2.666667     21
## 105      F  19       40 2.583333 1.375 3.000000     27
## 106      F  21       35 3.500000 2.250 2.750000     28
## 107      F  21       32 3.083333 3.625 3.083333     23
## 108      F  22       26 4.250000 3.750 2.500000     21
## 109      F  25       20 3.166667 4.000 2.333333     25
## 110      F  21       27 3.083333 3.125 3.000000     11
## 111      F  22       32 4.166667 3.250 3.000000     19
## 112      F  25       33 2.250000 2.125 4.000000     24
## 113      F  20       39 3.333333 2.875 3.250000     28
## 114      M  24       33 3.083333 1.500 3.500000     21
## 115      F  20       30 2.750000 2.500 3.500000     24
## 116      M  21       37 3.250000 3.250 3.833333     24
## 117      F  20       25 4.000000 3.625 2.916667     20
## 118      F  20       29 3.583333 3.875 2.166667     19
## 119      M  31       39 4.083333 3.875 1.666667     30
## 120      F  20       36 4.250000 2.375 2.083333     22
## 121      F  22       29 3.416667 3.000 2.833333     16
## 122      F  22       21 3.083333 3.375 3.416667     16
## 123      M  21       31 3.500000 2.750 3.333333     19
## 124      M  22       40 3.666667 4.500 2.583333     30
## 125      F  21       31 4.250000 2.625 2.833333     23
## 126      F  21       23 4.250000 2.750 3.333333     19
## 127      F  21       28 3.833333 3.250 3.000000     18
## 128      F  21       37 4.416667 4.125 2.583333     28
## 129      F  20       26 3.500000 3.375 2.416667     21
## 130      F  21       24 3.583333 2.750 3.583333     19
## 131      F  25       30 3.666667 4.125 2.083333     27
## 132      M  21       28 2.083333 3.250 4.333333     24
## 133      F  24       29 4.250000 2.875 2.666667     21
## 134      F  20       24 3.583333 2.875 3.000000     20
## 135      M  21       31 4.000000 2.375 2.666667     28
## 136      F  20       19 3.333333 3.875 2.166667     12
## 137      F  20       20 3.500000 2.125 2.666667     21
## 138      F  18       38 3.166667 4.000 2.250000     28
## 139      F  21       34 3.583333 3.250 2.666667     31
## 140      F  19       37 3.416667 2.625 3.333333     18
## 141      F  21       29 4.250000 2.750 3.500000     25
## 142      F  20       23 3.250000 4.000 2.750000     19
## 143      M  21       41 4.416667 3.000 2.000000     21
## 144      F  20       27 3.250000 3.375 2.833333     16
## 145      F  21       35 3.916667 3.875 3.500000      7
## 146      F  20       34 3.583333 3.250 2.500000     21
## 147      F  18       32 4.500000 3.375 3.166667     17
## 148      M  22       33 3.583333 4.125 3.083333     22
## 149      F  22       33 3.666667 3.500 2.916667     18
## 150      M  24       35 2.583333 2.000 3.166667     25
## 151      F  19       32 4.166667 3.625 2.500000     24
## 152      F  20       31 3.250000 3.375 3.833333     23
## 153      F  20       28 4.333333 2.125 2.250000     23
## 154      F  17       17 3.916667 4.625 3.416667     26
## 155      M  19       19 2.666667 2.500 3.750000     12
## 156      F  20       35 3.083333 2.875 3.000000     32
## 157      F  20       24 3.750000 2.750 2.583333     22
## 158      F  20       21 4.166667 4.000 3.333333     20
## 159      F  20       29 4.166667 2.375 2.833333     21
## 160      F  19       19 3.250000 3.875 3.000000     23
## 161      F  19       20 4.083333 3.375 2.833333     20
## 162      F  22       42 2.916667 1.750 3.166667     28
## 163      M  35       41 3.833333 3.000 2.750000     31
## 164      F  18       37 3.166667 2.625 3.416667     18
## 165      F  19       36 3.416667 2.625 3.000000     30
## 166      M  21       18 4.083333 3.375 2.666667     19
# the dataset has 166 rows and 7 columns
dim(learning2014)
## [1] 166   7
# there are 7 variables and 166 observations. One variable is a character string, while the other variables are integers or numbers
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# a summary of the variables
summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :32.00   Median :3.667  
##                     Mean   :25.51   Mean   :31.43   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

2. Part

Here I present a graphical overview of the data and show summaries of the variables in the data.

# accessing the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# creating a plot matrix to give a graphical overview of the data
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# drawing the plot
p

# creating summaries of the variables
summary(factor(learning2014$gender))
##   F   M 
## 110  56
summary(learning2014$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   21.00   22.00   25.51   27.00   55.00
summary(learning2014$attitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   26.00   32.00   31.43   37.00   50.00
summary(learning2014$deep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   3.333   3.667   3.680   4.083   4.917
summary(learning2014$stra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.188   3.121   3.625   5.000
summary(learning2014$surf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   2.417   2.833   2.787   3.167   4.333
summary(learning2014$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

The data is divided in the plot matrix by gender, with females depicted by the color red and males by the color green. Looking at the distributions of variables in the graphical overview, you can quickly notice that the age-variable is strongly skewed to the right. This is expected as the study population consists of students. We also notice a slight left skew in deep learning (deep) and the points-variable. The attitude-variable is slightly left skewed in males, but almost normal distribution in females. Additionally, suface learning (surf) is right skewed in males, but again closer to normal distribution in females.

We can further appreciate the distributions and characteristics of individual variables by looking at the scatter and box plots of the plot matrix. For example, you could inspect if a certain variable has lots of outliers in the box plot or if spread is larger or different in the scatter plot.

The graphical overview also shows correlation coefficients between variables. The most notable of these is the strong positive correlation between attitude and points. Also worth noting is the strong negative correlation between surface learning (surf) and both attitudeand deep learning (deep) in male students that is absent in female students. There is also a significant negative correlation between surface learning (surf) and strategic learning (stra) when the students are analyzied overall.

Looking at the summaries of the variables, you notice that the median age is quite young (median = 22 years) and that there are almost twice as many females as there are men (110 females to 56 males).

3. Part

Here I create a regression model where exam points is the target variable.

# accessing the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# creating a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
# creating new model without non-significant explanatory variable surf
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

My initial model had attitude, strategic and surface learning as explanatory variables, as they seemed to have the largest correlation coefficients in the plot matrix. Looking at the coefficients, we notice that there are positive estimates in attitude and strategic learning, while surface learning has a negative estimate. The t-statistic is the coefficient divided by the standard error. The larger the t-statistic value is, the more likely it is that the coefficient isn’t 0 and that there is a relationship between the variables. The t-value of attitude is clearly above 0, meaning there is a possibility that we can reject the null hypothesis (no relationship between target variable and explanatory variable) and declare a relationship between points and attitude. This is supported by the small Pr(>|t|) in attribute, meaning there is a very small chance of seeing a relationship between points and attitude due to chance. The same cannot be said about strategic and surface learning. While strategic learning does have a t-value above 0, the probability that this is due to chace is above the cut-off point (p <0,05) as stra has a Pr(>|t|) value of 0.11716. Surface learning is even more likely to have a relationship due to chance as surf has a t-value of -0.731 and a Pr(>|t|) value of 0.11716.

After the first model, I removed the surfvariable from the model since it didn’t have a significant relationship with the target variable points. The new model was fitted with only ´attitudeandstra` as explanatory variables.

4. Part

Next I use the summary of my new fitted model and explain the relationship between the chosen explanatory variables and the target variable.

# accessing the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# my new model without the non-significant explanatory variable surf and its summary
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

In my new model we notice that both attitude and strategic learning have still positive estimates (0.34658 and 0.91365 respectively). This means that for every point the attitude-variable increases, the exam points increase 0.34658 on average (and 0.91365 in the case of stra). The t-statistics of attitude is clearly above 0 (t-value: 6.132), meaning there is a high possibility that we can declare a relationship between points and attitude. This is supported by the small Pr(>|t|) in attribute (Pr(>|t|): 6.31e-09), meaning there is a very small chance of seeing a relationship between points and attitude due to chance. Even though the t-statistics of stra is also clearly above 0 (t-value: 1.709) and the Pr(>|t|) is smaller than before (Pr(>|t|): 0.08927), it still falls short of the default cut-off point of p <0.05. Therefore, we cannot say that there is a significant relationship between pointsand stra. Rather there is strong evidence of a potential relationship, but there is approximately an 8,9% probability that this is due to chance.

The R-squared statistic provides a measure of how well the model fits data. A model with a R-squared value of 0 doesn’t explain at all the variance in exam points and on the other hand a model with a R-squared value of 1 explains all of the variance in exam points. The square of the multiple correlation coefficient (multiple R-squared) of the model is 0.2048. As the multiple R-squared value increases with each additional explanatory variable, it is preferred to use the adjusted R-squared as it adjusts the value for the numbers of variables considered. The adjusted square of the multiple correlation coefficient (adjusted R-squared) is 0.1951, i.e. these variables accounted for about 20% of the variation in the exam points. This is marginally more than in the previous model. The residual standard error is a measure of the quality of the linear regression fit. It is the average amount that the target variable points will deviate from the true regression line. The smaller the residual standard error the better the model fits the dataset. In this model the residual standard error is 5.289, which is slightly better than the first model. The omnibus F-test had a very low associated p-value (p-value: 7.734e-09), so there is very strong evidence that neither of the two regression coefficients are zero.

5. Part

Finally I produce the following diagnostic plots of my model: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

# divides the window into a 2x2 grid
par(mfrow = c(2,2))

# draws the diagnostic plots using the plot() function. The previously specified plots 1, 2 and 5 are selected
plot(my_model2, c(1,2,5))

Linear regression has four main assumptions: linear relationship, independence, homoscedasticity and normality.

We can use the residuals vs. fitted plot to evaluate if the there is correlation between residuals and fitted values. As we can see, the datapoints have a fairly random distribution around the 0 line, which supports the hypothesis that it is linear and the data is reasonably homoscedastic. There are a few outliers with large negative residuals, but these don’t appear to alter the overall variance in the data.

Using the normal Q-Q plot, we can evaluate the skewness of data and fit of model. My model seems to be slightly skewed left, but is fairly normally distributed overall.

Finally, we can use the residuals vs. leverage plot to assess linearity and influential points. Influential points are observations, that when removed from the dataset could change the coefficients of the model when fitting again. Here we can see that the the spread is nicely linear. No influentials points are observed in the spread.



Assignment 3 - Logistic regression

date()
## [1] "Mon Dec 12 19:16:18 2022"

Reading joined student alcohol consumption data

The data we are using in this assignment was acquired from the UCI Machine Learning Repository, Student Performance Data (incl. Alcohol consumption) page and later modified.

#install required packages if needed
#install.packages("tidyverse")
#install.packages("boot")
#installed.packages("tidyr")
#install.packages("readr")
#install.packages("dplyr")
#install.packages("patchwork")

# access the readr and dplyr library
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#read the 'alc' data set into memory
alc <- read_csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/alc.csv", show_col_types=FALSE)

# look at the column names/names of the variables of the data set
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data was collected from students in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. As we can see, there are 35 variables in our data set. Our focus in this assignment is to evaluate the possible relationship between high/low alcohol use (high_use) and other variables.

Choosing variables in relation to high/low alcohol consumption

I chose student’s sex (sex), going out with friends (goout), number of past class failures (failures) and number of schools absences (absences) as my variables of interest. My hypothesis is that male students have on average a higher consumption of alcohol when compared to female students, as males have also higher alcohol consumption in the general population. I also hypothesize that students with more absences and class failures have higher alcohol consumption on average, as people with excessive alcohol might miss work due to hangovers etc. Additionally, I hypothesize that going out with friends more often increases risk of alcohol consumption as alcohol tends to be frequent component when young adults socialize.

Exploring the chosen variables

# access tidyr library
library(tidyr)

# produce summary statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_going_out = mean(goout), mean_failures = mean(failures), mean_absences = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 6
## # Groups:   sex [2]
##   sex   high_use count mean_going_out mean_failures mean_absences
##   <chr> <lgl>    <int>          <dbl>         <dbl>         <dbl>
## 1 F     FALSE      154           2.95         0.104          4.25
## 2 F     TRUE        41           3.39         0.293          6.85
## 3 M     FALSE      105           2.70         0.143          2.91
## 4 M     TRUE        70           3.93         0.386          6.1

Looking at our summarised table, we can see that there are several more males that have high alcohol consumption rates when compared to females (70 and 41, respectively), even though there are almost the same amount of males and females in total (175 and 195, respectively). We can also make note that students with high alcohol consumption go out with their friends more often than those students that don’t have high alcohol consumption, regardless of sex. Additionally, we can also notice that the average number of class failures and school absences are higher with students with high alcohol consumption than others regardless of sex.

Next we will take a closer look at the distributions of failuresand absences.

# access ggplot2 and patchwork libraries
library(ggplot2)
library(patchwork)
theme_set(theme_classic())

# closer look at the distribution of failures and goout
alc %>% count(failures)
## # A tibble: 4 × 2
##   failures     n
##      <dbl> <int>
## 1        0   325
## 2        1    24
## 3        2    17
## 4        3     4
alc %>% count(goout)
## # A tibble: 5 × 2
##   goout     n
##   <dbl> <int>
## 1     1    22
## 2     2    97
## 3     3   120
## 4     4    78
## 5     5    53
# create histograms for failures and goout
h1 <- alc %>% ggplot(aes(x = failures)) + geom_histogram(binwidth = 1) + scale_fill_grey()
h2 <- alc %>% ggplot(aes(x = goout)) + geom_histogram(binwidth = 1) + scale_fill_grey()

# draw histograms of failures and goout
h1 + h2

As we can see from the tables and histograms, class failures are heavily skewed right as the vast majority of students haven’t failed any class and only a handful of students have failed at least one class (325 and 45, respectively). Going out with friends has almost a normal distribution, although only a few students go out with their friends very little.

Next we will look at the chosen variables together with their relationship with alcohol consumption.

# define the plot as a bar plot
g1 <- alc %>% ggplot(aes(x = sex, fill = high_use)) + geom_bar(position = "fill") + ylab("proportion") + theme(legend.position = "none") + scale_fill_grey()

# define the plot as a bar plot
g2 <- alc %>% ggplot(aes(x = failures, fill = high_use)) + geom_bar(position = "fill") + ylab("proportion") + theme(legend.position = "none") + scale_fill_grey()

# define the plot as a box plot
g3 <- alc %>% ggplot(aes(x = high_use, y = absences)) + geom_boxplot() + xlab("high alcohol consumption")

# define the plot as a bar plot
g4 <- alc %>% ggplot(aes(x = goout, fill = high_use)) + geom_bar(position = "fill") + ylab("proportion") + xlab("going out with friends") + scale_fill_grey(name = "high alcohol consumption")

# draw the plots
g1 + g2 + g3 + g4

Looking first at the sex variable, we notice that a higher proportion of males have high alcohol consumption compared to women as we also noted previously.

Next looking at the failures variable, we notice that the proportion of students with high alcohol consumption increases with the number of failed classes, which suggests a positive correlation between them.

Next looking at the absences variable, we notice that the median absences and spread seems to be larger with students with high alcohol consumption. This might also point to positive correlation between them.

Finally, looking at the goout variable, we notice that proportion of students with alcohol consumption increases with the number of failed classes, similarly to the failures variable. This also suggests a positive correlation between them.

The findings in these plots seem to support my initial hypothesis regarding these chosen variables.

Creating a logistic regression model using the selected variables

Next I created a logistic regression model to predict high/low alcohol consumption. The target value is high_use and the explanatory variables are failures, absences, sex and goout. Here i provide a summary of my model and interpret the coefficients in my model as odds ratios.

# create logistic regression model using selected variables
m <- glm(high_use ~ failures + absences + sex + goout, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1521  -0.7929  -0.5317   0.7412   2.3706  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.14389    0.47881  -8.654  < 2e-16 ***
## failures     0.48932    0.23073   2.121 0.033938 *  
## absences     0.08246    0.02266   3.639 0.000274 ***
## sexM         0.97989    0.26156   3.746 0.000179 ***
## goout        0.69801    0.12093   5.772 7.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.50  on 365  degrees of freedom
## AIC: 379.5
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR) for the coefficients of the model by exponentiation
OR <- coef(m) %>% exp

# compute confidence intervals (CI) for the odds ratios by exponentiation
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR       2.5 %     97.5 %
## (Intercept) 0.01586098 0.005950831 0.03905075
## failures    1.63121360 1.042981792 2.59042892
## absences    1.08595465 1.039999394 1.13814475
## sexM        2.66415000 1.605059602 4.48576247
## goout       2.00975350 1.594527456 2.56461657

All the explanatory variables have a positive regression coefficient, which means that the variables increase the probability of high consumption of alcohol. The z-value is the regression coefficient divided by the standard error. If it is clearly above or below 0, then it is likely that it is significant variable. Usually the cut-off value used for significance is 2.0 (which corresponds to p <0.05), which is met by all my chosen variables. We can see that failures has a significant relationship with high_use and the other chosen variables have very strongly significant relationships with high_use.

Looking at the odds ratios of my explanatory variables, we see that there is a strong positive association between the variables `goout and sex and high_use as both the OR and CI are clearly above 1.0. In addition, the variables failures and absences have also a positive association with high_use although their increase in probability of high alcohol consumption is not as large as with goout and sex.

Next we explore the predictive power of my model.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   243   16
##    TRUE     61   50
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65675676 0.04324324 0.70000000
##    TRUE  0.16486486 0.13513514 0.30000000
##    Sum   0.82162162 0.17837838 1.00000000
# computing the total proportion of inaccurately classified individuals ( = training error)
(16+61)/370
## [1] 0.2081081

From the cross tabulation we can see that our model is fairly good at predicting low consumption of alcohol, but not as good when alcohol consumption is high (243/259 and 50/111, respectively). This is made clearer by looking at the point plot.

The training error of my model was about 0.21.

Assuming a person just simply guesses by random if the alcohol consumption of a student is high or low, then it is likely that the training error of this guessing would approach 0.50 (flipping a coin). This would mean my model is able to classify approximately 30% more of students correctly than by simply guessing.

Bonus: Performing 10-fold cross validation of my model

As an additional task we were asked to perform a 10-fold cross validation of our models.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2108108

As we can see, my model has a prediction error of about 0.21, which is better than the model introduced in the Exercise Set 3 (prediction error ≈ 0,26).



Assignment 4 - Clustering and classification

date()
## [1] "Mon Dec 12 19:16:20 2022"
# install.packages(c("MASS", "corrplot", "tidyverse", "reshape2")) if needed

Exploring our data

In this assignment we look at the Boston data set from the MASS library. It describes housing values in suburbs of Boston and various other factors relating to the housings and locations. The data frame has 14 variables and 506 observations (14 columns and 506 rows). Here I show the structure of the data frame and a short summary of the variables in question.

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
# load the Boston data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Graphical overview of data

Here I visualize the distribution of the variables and the relationships between them.

# access the required libraries
library(tidyr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
# convert data from wide to long
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
# visualize variables with small multiple chart
g1 <- ggplot(data = melt.boston, aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")
g1

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(2)

# test correlation 
testRes = cor.mtest(cor_matrix, conf.level = 0.95)

# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.col = "black", lower.col = "black", number.cex = 0.7, tl.cex=0.7)

Looking at the previous summary of the variables and visualization afterwards, we can get a good understanding of the data. We can see that crime rate crim, proportion of residential lands zoned zn, distance to employment centers dis and lower status of the population lstat are all clearly skewed right. On the other side, proportion of blacks black, proportion of units built before 1940 age and pupil-teacher ratio ptratio are all clearly skewed left. Porportion of non-retail business acres indus, accessability to highways rad and property-tax rate tax have bimodal distributions (Charles River dummy variable char is a binary variable). Looking at the correlation matrix, we can see that the highest positive correlations are between tax and rad, indus and nox (nitrogen oxide concentration), age and nox, and rm (average number of rooms) and medv (median value). We can also see that the highest negative correlations are between dis and nox, dis and age, medvand lstat, and dis and indus.

Standardizing the data

In order to standardize our data we have to scale the data.

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# print summary of scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

After scaling the data we can notice that the means of all variables are 0, meaning that the scale has been fitted for each variable so that the range of each variable is approximately the same.

Next we create a categorical variable of the crime rate in the scaled data set. After this, we divide the data set to train and test sets (80% of the data belongs to the train set).

#access the dplyr library
library(dplyr)

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Fitting the linear discriminant analysis

Here we fit a linear discriminant analysis (LDA) on the train set, using the categorical crime rate crime as the target variable and all the other variables in the data set as predictor variables. We then visualize this analysis with a biplot.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

## Predicting classes using the LDA model

Next we first save the crime categories from the test set and then remove the categorical crime variable from the test data set. AFter this, we predict the classes with the LDA model on the test data. Finally we cross tabulate the results with the crime categories from the test set.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15      13        2    0
##   med_low    3      11        5    0
##   med_high   1      11       12    1
##   high       0       0        1   27

Looking at the cross tabulation we can see that the overall prediction rate of our model is approximately 68% (correct predictions devided by all predictions), which is fairly good. We notice that our model is really good at predicting high crime rate, while not as good at predicting the other categories.

Measuring distance and clustering

Next we reload the Boston data set and standardize it. Then we calculate the distance between observations.

library(MASS)
data("Boston")

# center and standardize variables as done previously
boston_scaled2 <- scale(Boston)

# change the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled2)

# create the euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next we run the k-means algorithm on the data set and after finding the optimal number of clusters we visualize them.

# Use seed to mitigate the effect of random number generators
set.seed(1234)

# k-means clustering
km <- kmeans(Boston, centers = 3)

# plot the scaled data set with clusters
pairs(boston_scaled2, col = km$cluster)

# determine the number of clusters for optimization
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

# k-means clustering with optimal number of clusters
km2 <- kmeans(boston_scaled2, centers = 2)

# plot the scaled dataset with clusters
pairs(boston_scaled2, col = km2$cluster)

The k-means alorithm was initially done with three clusters. Visualizing how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes, we can see the total WCSS drops radically at two clusters meaning this is our optimal number of clusters. We then preformed the k-means algorithm again with the optimal number of clusters and visualized it. Looking at our plot with two clusters we can see that overall the clusters are fairly well separated within the variables, meaning the clustering seems to work properly. Comparing the final plot to the original k-mean algorith with 3 clusters we can see that there is more overlap between clusters and is therefore harder to interpret.



Assignment 5 - Dimensionality reduction techniques

date()
## [1] "Mon Dec 12 19:16:26 2022"

Exploring the data

Here we look at the variables of the human data, that we made previously.

#read the 'human' data set into memory
human <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/human.csv", row.names = 1)

summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
#access GGally library
library(GGally)

#visualize the 'human' variables
ggpairs(human, progress = FALSE)

#access corrplot library
library(corrplot)

#compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()

As we look at the summaries of the variables and the pairs plot, we can see that Gross National Income per capita (GNI), maternal mortality ratio (Mat.Mor) and adolescent birth rate (Ado.Birth) all have heavily right-skewed distributions. On the other hand, both Life expectancy (Life.Exp) and ratio of labor force participation of females and males (Labo.FM) have left-skewed distributions. The distribution ratio of female and male populations with secondary education has a positive kurtosis.

Looking at the pairs plot and correlation plot we notice that there are a few significant correlations between our variables. There is a high positive correlation between Ado.Birth and Mat.Mor (0.759). There is also a high negative correlation between Life.Exp and both Mat.Mor (-0.857) and Ado.Birth (-0.729). On the other hand, there is a high positive correlation between Life.Exp and expected years of schooling (Edu.Exp) (0.789). Lastly, there is a positive correlation between Gross National Income per capita (GNI) and both Life.Exp (0.627) and Edu.Exp (0.624).

Principal component analysis

Next we perform the principal component analysis (PCA) on the raw (non-standardized) human data.

#perform principal component analysis (with the singular value decomposition method)
pca_human <- prcomp(human)

#create a summary of pca_human
s <- summary(pca_human)

#rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 2)

#print out the percentages of variance
pca_pr
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00
#create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

#draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("navyblue", "firebrick1"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
*PCA with raw data: PC1 explains almost all of the variance. GNI has a heavy negative loading on PC1, while the other variables do not seem to have a significant loading on either PC.*

PCA with raw data: PC1 explains almost all of the variance. GNI has a heavy negative loading on PC1, while the other variables do not seem to have a significant loading on either PC.

Next we standardize the variables and repeat the above analysis.

#standardize the variables
human_std <- scale(human)

#perform principal component analysis on the standardized variables (with the SVD method)
pca_human_std <- prcomp(human_std)

#create and print out a summary of pca_human_std
s2 <- summary(pca_human_std)

#rounded percentanges of variance captured by each PC
pca_pr2 <- round(100*s2$importance[2, ], digits = 2)

#print out the percentages of variance
pca_pr2
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 53.61 16.24  9.57  7.58  5.48  3.60  2.63  1.30
# create object pc_lab to be used as axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")

#draw a biplot of the principal component representation and the standardized variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("navyblue", "firebrick1"), xlab = pc_lab2[1], ylab = pc_lab2[2])
*PCA with standardized data: PC1 explains 53.61% of the variance, while PC2 explains 16.24%. Mat.Mor and Ado.Birth have both positive loading on PC1, while Edu2.FM, GNI, Edu.Exp and Life.Exp have negative loading on PC1. Parli.F and Labo.FM have both a positive loading on PC2. No variables seem to have a significant negative loading on PC2.*

PCA with standardized data: PC1 explains 53.61% of the variance, while PC2 explains 16.24%. Mat.Mor and Ado.Birth have both positive loading on PC1, while Edu2.FM, GNI, Edu.Exp and Life.Exp have negative loading on PC1. Parli.F and Labo.FM have both a positive loading on PC2. No variables seem to have a significant negative loading on PC2.

As we can see, the results for the two PCAs are very different. In the first PCA using the non-standardized data, we see that the first principal component (PC1) explains almost all of the variance (99,99%) and the second principal component (PC2) explains only 00,01%. After standardization, PC1 explains 53,61% and PC2 explains 16,24% of the variance. I believe this difference is due to the different scaling of the raw variables. Looking at the previous section, we can see that GNIhas a much larger scale compared to the other variables. This is evident in the first biplot, where the only arrow that is visible is GNI. This shows that the first PCA has loaded heavily on the GNI variable. Once the variables are standardized the scaling is very similar between variables and the PCA is no longer influenced by possible different variable scales.

My own interpretation of the first two principal component dimensions is that the PC1 relates to how developed the country is. The reasoning behind this comes from the variables related to PC1: Mat.Mor, Ado.Birth, Edu2.FM, GNI, Edu.Exp, Life.Exp. All of the variables are key factors when assessing the development stage of a country. This is further supported when looking at biplot as richer and more developed countries (e.g. many European nations) are on the left side while many poorer countries (e.g. many African nations) are on the right side. The PC2 seems to be related to female participation in society as the variables related to PC2 were Parli.F and Labo.FM. We can also notice that the countries that are furthest away from these variables are fairly male-centric (e.g. arabic nations).

Multiple Correspondence Analysis

Next we will look at tea data. The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions). We will look at the dimensions and structure of the data and finally visualize it.

#read tea data set into memory and convert character variables to factors
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

#look at the dimension and structure of the data
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#browse the contents of the data frame
View(tea)

#access dplyr library
library(dplyr)

#selecting columns to keep in the data set for visualization and MCA
keep_columns <- c("Tea", "How", "sugar", "how", "sex", "age_Q", "frequency")

#select the 'keep_columns' to create a new data set
tea_time <- dplyr::select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#access the ggplot and tidyr library
library(ggplot2)
library(tidyr)

#visualize the data set
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) + facet_wrap("name", scales = "free")

Looking at the structure of the tea data frame, we see that there are 36 different variables and 300 observations. Almost all of the variables are factors except age which is an integer. For the visualization and upcoming analysis I chose seven variables that seemed interesting to me. From the visualization we can see that Earl Grey is the most popular tea variety. Most people drink their tea alone (without any additions). Most people drink tea at least once per day. Most people use tea bags. Use of sugar is fairly split between. There were slightly more females than males in the questionnaire. The most populous age quartile from the questionnaire was ages 15 to 24.

Next we do the multiple correspondence analysis (MCA) for the selected variables.

#multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)

#summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.246   0.216   0.186   0.179   0.168   0.159   0.149
## % of var.             10.752   9.435   8.137   7.841   7.347   6.947   6.534
## Cumulative % of var.  10.752  20.187  28.324  36.165  43.512  50.459  56.993
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.145   0.130   0.123   0.122   0.113   0.101   0.095
## % of var.              6.358   5.701   5.396   5.358   4.933   4.419   4.171
## Cumulative % of var.  63.351  69.052  74.448  79.807  84.740  89.158  93.329
##                       Dim.15  Dim.16
## Variance               0.078   0.074
## % of var.              3.430   3.241
## Cumulative % of var.  96.759 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## 1         |  0.015  0.000  0.000 |  0.558  0.481  0.140 |  0.396  0.282  0.071
## 2         |  0.611  0.507  0.171 |  0.049  0.004  0.001 |  0.020  0.001  0.000
## 3         |  0.366  0.181  0.107 | -0.559  0.483  0.250 | -0.331  0.197  0.088
## 4         | -0.752  0.768  0.450 |  0.019  0.001  0.000 |  0.147  0.039  0.017
## 5         |  0.241  0.079  0.043 | -0.109  0.018  0.009 | -0.326  0.190  0.078
## 6         | -0.376  0.192  0.114 | -0.123  0.023  0.012 | -0.073  0.009  0.004
## 7         |  0.096  0.012  0.003 |  0.144  0.032  0.008 |  0.082  0.012  0.003
## 8         |  0.449  0.273  0.065 |  0.155  0.037  0.008 |  0.664  0.791  0.143
## 9         |  0.257  0.089  0.028 |  0.014  0.000  0.000 |  0.050  0.004  0.001
## 10        |  0.675  0.619  0.199 |  0.096  0.014  0.004 |  0.001  0.000  0.000
##            
## 1         |
## 2         |
## 3         |
## 4         |
## 5         |
## 6         |
## 7         |
## 8         |
## 9         |
## 10        |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test  
## black     |   1.056  15.999   0.365  10.452 |   0.459   3.448   0.069   4.545 |
## Earl Grey |  -0.457   7.812   0.377 -10.614 |  -0.293   3.658   0.155  -6.804 |
## green     |   0.304   0.592   0.011   1.850 |   0.684   3.405   0.058   4.155 |
## alone     |  -0.035   0.047   0.002  -0.831 |  -0.180   1.389   0.060  -4.232 |
## lemon     |  -0.330   0.698   0.013  -2.009 |   0.487   1.725   0.029   2.958 |
## milk      |   0.025   0.008   0.000   0.227 |   0.306   1.306   0.025   2.731 |
## other     |   1.798   5.635   0.100   5.466 |  -0.037   0.003   0.000  -0.114 |
## No.sugar  |   0.631  11.946   0.425  11.275 |  -0.223   1.704   0.053  -3.990 |
## sugar     |  -0.674  12.770   0.425 -11.275 |   0.239   1.822   0.053   3.990 |
## tea bag   |  -0.192   1.218   0.048  -3.802 |  -0.036   0.048   0.002  -0.708 |
##             Dim.3     ctr    cos2  v.test  
## black       0.537   5.456   0.094   5.310 |
## Earl Grey  -0.010   0.005   0.000  -0.225 |
## green      -1.147  11.110   0.163  -6.971 |
## alone      -0.318   5.037   0.187  -7.485 |
## lemon       0.412   1.434   0.021   2.505 |
## milk        0.375   2.267   0.037   3.342 |
## other       2.747  17.383   0.233   8.352 |
## No.sugar   -0.320   4.076   0.110  -5.730 |
## sugar       0.343   4.357   0.110   5.730 |
## tea bag     0.334   4.859   0.146   6.607 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## Tea       | 0.420 0.159 0.216 |
## How       | 0.110 0.067 0.340 |
## sugar     | 0.425 0.053 0.110 |
## how       | 0.084 0.171 0.150 |
## sex       | 0.045 0.517 0.000 |
## age_Q     | 0.478 0.403 0.421 |
## frequency | 0.159 0.140 0.065 |
#visualize MCA with the variable biplot
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

Looking at the variable biplot, we can see that the first two dimensions of the MCA capture 10.75% and 9.44% of the total variance. The horizontal dimension opposes “other” tea drinkers with the other options. The vertical dimension opposes male and female as well as unpackaged tea to the other options. It also seems to oppose Earl Grey form the other tea varieties. The age quartiles are linked to both dimensions. Looking at how the variables are spread, we can make the assumption that older men are more likely to use unpackaged tea of green and black variety than young females.



Assignment 6 - Analysis of longitudinal data

date()
## [1] "Mon Dec 12 19:16:36 2022"

Graphical Displays and Summary Measure approach

Graphical displays of the longitudinal data

First we look at the long form RATS (RATSL) data. It is derived from a nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven (when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.

RATSL <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/RATSL.csv", sep = ",", header = TRUE)

#Convert ID and Group variables into factors
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

#Check that variables and structure is correct
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
summary(RATSL)
##        ID      Group       WD                Weight           Time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110

As we can see there are 176 observations and 5 different variables. ID is the identification number for the test subjects (16 in total). Group is one of the three different diets the test subject is on. The time of weighing is presented as both the character variable WD and the integer variable Time (11 in total). And then finally we have the Weight variable that gives the weight of the test subject on the given day.

Next we will look at the RATSL data using a graphical display. The plot will display individual response profiles by diet group for each test subject.

#Access the package ggplot2
library(ggplot2)

#Set theme for plots in this chapter
theme_set(theme_classic())

#Draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

As we can see from the plot, the weight of almost all the rats is increasing during the 9 week period. This weight gain seems to be more pronounced in groups 2 and 3 than in group 1. Additionally, the rats that had lower or higher weights in the beginning had usually also lower or higher weights at the end of the study. This phenomenon is generally referred to as tracking. There doesn’t seem to be very much individual differences between test subjects inside diet groups.

Next we will look more closely at the individual responses using standardized values, which should give us a better look at the tracking phenomenon.

#Access the dplyr and tidyr packages
library(dplyr)
library(tidyr)

#Standardize the variable Weight
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdWeight = (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()

#Plot again with the standardised Weight
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(name = "standardized Weight")

We can see there is some change with individual weights, but generally high weighted test subjects stayed high weighted throughout the testing period as did also light weighted test subjects.

As graphs with individual responses can be quite hard to interpret when using large numbers of observations, we will next create a summary graph using mean response profiles to better illustrate the difference between diet groups.

#Summary data with mean and standard error of Weight by Group and Time 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = (sd(Weight)/sqrt(length(Weight)))) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
#Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.4)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

We can see that there isn’t any overlap between the diet groups and their mean response profiles seem to run fairly parallel throughout the plot. As previously suggested, group 1 seems to have less of an increase in mean weight compared to groups 2 and 3. Regardless, it seems that there isn’t any significant difference between the three diet groups.

Summary measure analysis of longitudinal data

Next we will use the summary measure approach to evaluate the response of each diet group over time. We will use the mean weight of days 8 to 64 as the chosen summary measurement.

#Create a summary data by Group and ID with mean as the summary variable (ignoring baseline Time 1)
RATSL10S <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
#Draw a boxplot of the mean versus Group
ggplot(RATSL10S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), Time 8-64")

#Remove the one clear outlier
RATSL10S1 <- RATSL10S %>% filter(mean < 550)

#Redraw the previous plot without the outlier
ggplot(RATSL10S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), Time 8-64")

Looking at the first box plot, we notice the mean summary measure is more variable in group 2 than in the other diet groups. The distribution of group 3 is slightly skewed. Group 2 is also clearly skewed and has an outlier, a test subject with a weight of almost 600 grams. It might bias the conclusions from further comparisons of the groups, so we decide to remove it from the group. Note that there is also an outlier in group 1 and 3, but these outliers aren’t as pronounced as the one in group 2 so we decide to keep them in the groups.

Without the outlier the boxplot is a bit easier to interpret, the variance of group 2 is very small now and the mean weight of group 2 is now more clearly lower than of group 3.

The plots seem to point that there is a significant difference in the mean weight between groups, which can be further displayed using formal statistical tests. We will also use pre-diet values as a covariate in analysis of covariance to see if the diets have an effect on the weight separate of baseline values.

#Create model to use in ANOVA
one.way <- lm(mean ~ Group, RATSL10S1)

#Compute the ANOVA table for the model
anova(one.way)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 207659  103830  501.81 2.721e-12 ***
## Residuals 12   2483     207                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Perform a two-sample t-test for Groups 1 and 2
t.test(mean ~ Group, data = RATSL10S1, subset = Group %in% c(1,2), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -25.399, df = 9, p-value = 1.094e-09
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -204.0633 -170.6867
## sample estimates:
## mean in group 1 mean in group 2 
##         265.025         452.400
#Perform a two-sample t-test for Groups 2 and 3
t.test(mean ~ Group, data = RATSL10S1, subset = Group %in% c(2,3), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -5.632, df = 5, p-value = 0.002446
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -109.37769  -40.82231
## sample estimates:
## mean in group 2 mean in group 3 
##           452.4           527.5
#Perform a two-sample t-test for Groups 1 and 3
t.test(mean ~ Group, data = RATSL10S1, subset = Group %in% c(1,3), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3 
##         265.025         527.500
#Read the original RATS data into memory for the next part
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')

#Add the baseline from the original data as a new variable to the summary data
RATSL10S2 <- RATSL10S %>%
  mutate(baseline = RATS$WD1)

#Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSL10S2)

#Compute the ANOVA table for the fitted model
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that there is a significant difference between the mean weight of the diet groups. Examining in more detail with the t-test, we see that the difference in mean weight is significant between all diet groups. This is expected, as the difference in weights between groups was clearly visible from the start.

Looking at the ANOVA table with pre-diet weights as a variable we can see that these values are strongly related to weight values taken during the diets. Although there is slight evidence of diet difference after conditioning on the baseline value (p 0.076), the difference is nevertheless not significant.

Linear Mixed Effect Models for Normal Response Variables

Plotting the data

In the second part of our assignment we look at the long form BPRS data (BPRSL). It is derived from the BPRS data, in which 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.

BPRSL <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/BPRSL.csv", sep = ",", header = TRUE)

#Convert treatment and subject variables into factors
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)

#Check that variables and structure is correct
str(BPRSL)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

Looking at the output, we see there are 360 observations and 5 different variables. The variable ‘treatment’ is one of the two treatment options the subjects were subjected to. Subject is the identification number for the subjects (40 in total). The time points for evaluating the BPRS rating is presented as both the character variable ‘weeks’ and the integer variable ‘week’ (9 in total). And then finally we have the bprs variable that gives the BPRS rating of the subject on the given day.

Next we plot the BPRSL data to visually evaluate if there is a difference between treatment groups.

#Give individual subjects unique identification numbers
BPRSL$subject <- as.numeric(BPRSL$subject)
BPRSL <- BPRSL %>% mutate(subject = case_when(treatment == 2 ~ subject + 20, TRUE ~ as.numeric(subject)))

#Convert subject variable back into factors
BPRSL$subject <- factor(BPRSL$subject)

#Plot the BPRSL data
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line() + aes(linetype = treatment) + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 10, 1)) + scale_y_continuous(name = "BPRS") + theme(legend.position = "top")

Initially it is hard to see a significant difference between the two treatment groups, although the subjects in treament 2 seem to have more of the higher BPRS ratings at the end of the 8 week period compared to subjects in treatment 1.

Creting models for our data

Ignoring the repeated-measures structure of our data, we then create a regression model for the BPRSL data.

# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Looking at the results of our model we can see that the regression on the week variable is highly significant. However, the treatment groups do not differ significantly from each other. As this form of analysis is not appropriate for our repeated-measures data we then move on to consider some more appropriate models.

Next we create the random intercept model as a more suitable model for the BPRSL data. We use ‘week’ and ‘treatment’ as explanatory variables.

#Access library lme4
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
#Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

#Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## week         -2.2704     0.1503 -15.104
## treatment2    0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.256       
## treatment2 -0.684  0.000

The estimated regression parameters for week and the treatment variable are quite similar to the results from the previous independence model.

Still looking for better models to represent the data, we create a random intercept and random slope model to fit the BPRSL data.

#Create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

#Print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 167.827  12.955        
##           week          2.331   1.527   -0.67
##  Residual              36.747   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.9830     2.6470  17.372
## week         -2.2704     0.2713  -8.370
## treatment2    1.5139     3.1392   0.482
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.545       
## treatment2 -0.593  0.000
#Perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRS_ref     5 2582.9 2602.3 -1286.5   2572.9                         
## BPRS_ref1    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimates of the fixed effects are once again quite similar to the previous model (BPRS_ref), but the likelihood ratio test for the random intercept model versus the random intercept and slope model gives a chi-squared statistic of 63.66 with 2 degrees of freedom (DF) and the associated p-value is very small (p 1.499 x 10^-14). This means that the random intercept and slope model provides a better fit for the data.

Finally, we can fit a random intercept and slope model that allows for a treatment × week interaction.

#Create a random intercept and random slope model with the treatment x week interaction
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.5   2554.5  -1253.7   2507.5      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4747 -0.5256 -0.0866  0.4435  3.7884 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 164.204  12.814        
##           week          2.203   1.484   -0.66
##  Residual              36.748   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.9840  16.047
## week             -2.6283     0.3752  -7.006
## treatment2       -2.2911     4.2200  -0.543
## week:treatment2   0.7158     0.5306   1.349
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.668              
## treatment2  -0.707  0.473       
## wek:trtmnt2  0.473 -0.707 -0.668
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1    7 2523.2 2550.4 -1254.6   2509.2                    
## BPRS_ref2    8 2523.5 2554.6 -1253.7   2507.5  1.78  1     0.1821

The likelihood ratio test of the interaction random intercept and slope model against the corresponding model without an interaction is 1.78 with 1 DF; the associated p-value is fairly small, but not at a significant level (<0.05). Although it cannot be said for certain, the interaction model seems to better explain our data and we will therefore choose it as our model of choice.

The estimated regression parameters for the interaction indicate that the growth rate slopes are higher for subjects in treatment 2 than in treatment 1 (on average 0.473 higher with an approximate 95% confidence interval [CI] of [-0.35, 1.78]). As we can see from the CI this difference is not significant as the CI contains 0.

Visualizing observed vs fitted values

Next we can evaluate how well our model fits the BPRSL data by finding the fitted values from the interaction model and plotting the fitted BPRS values against the observed BPRS values.

#Draw the plot of BPRSL with the observed bprs values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 10, 1)) +
  scale_y_continuous(name = "Observed BPRS") +
  theme(legend.position = "top")

#Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)

#Create a new column fitted to RATSL
BPRSL <- BPRSL %>% mutate(Fitted = Fitted)

#Draw the plot of BPRSL with the Fitted values of BPRS
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 10, 1)) +
  scale_y_continuous(name = "Fitted BPRS") +
  theme(legend.position = "top")

As we can see from the plots the interaction model fits the observed data quite nicely, although some of the variance and values are lost in the fitted plot compared to the observed plot.